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Agenda 


E Motivation and lA-64 feature overview 


= IA-64 features 
e EPIC 
e Data types, memory and registers 
e Register stack 
e Predication and parallel compares 
e Software pipelining and register rotation 
e Control & data speculation 
e Branch architecture 
e Integer architecture 
e Floating point architecture 


= Itanium" processor overview 
= Itanium” processor based systems overview 
m Operating systems, tools and programming 


= | 
i DA im Copyright © 2000, Intel Corporation. All rights reserved. 


*Other brands and names are the property of their respective owners 


|A-64: Extending the Intels 
Architecture 


m Designed for High Performance Computing 
e Scientific 
e Technical & Engineering 
e Business 

= New EPIC Technology 

= IA-64 Architecture uses EPIC 


= Itanium” processor is the first implementation of 
IA-64 


perf ormance, scalability, availability, compatibility 
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Performance Limiters 


= Parallelism not fully utilized 


e Existing architectures cannot exploit sufficient parallelism 
in integer code to feed a wide in-order implementation 


= Branches 


e Even with perfect branch prediction, small basic blocks of 
code do not fully utilize machine widt 


m Procedure Calls 


° Software yall eile is becoming standard resulting 
call/return overhea 


= Memory latency and address space 


e Increasing relative to processor cycle time (larger cache 
miss penalties) and limited address space 


IA-64 overcomes these limitations, 
and more ! 
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IA-64 Architectural Features wi 


m 64-bit Address Flat Memory Model 

= Explicit Parallel Instruction Computing 
= Large Regisiter Files 

= Automatic Register Stack Engine 

= Predication 

= Software Pipelining Support 

= Register Rotation 

Sophisticated Branch Architecture 
Loop Control Hardware 

Control & Data Speculation 

Cache Control 

Powerful Integer Architecture 
Advanced Floating Point Architecture 
HLS Ue tail: i lal Technology) 
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Next Generation k- 


Architecture 
EPIC Design Philosophy 


z = Maximize performance via 
hardware & software 


s f synergy 
= Advanced features 
YOO / 35 enhance instruction level 


parallelism 
e Predication, Speculation, ... 


= Massive hardware 
resources for parallel 
execution 








Performance ————————> 





Beyond Traditional Architectures 
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Agenda 


E Motivation and lA-64 feature overview 


= IA-64 features 
e EPIC 
e Data types, memory and registers 
e Register stack 
e Predication and parallel compares 
e Software pipelining and register rotation 
e Control & data speculation 
e Branch architecture 
e Integer architecture 
e Floating point architecture 


= Itanium" processor overview 
= Itanium” processor based systems overview 
m Operating systems, tools and programming 
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EPIC Instruction Parallelism 
os 





Source Code 








Instruction Groups 5 
(series of bundles) 





No RAW or WAW 
dependencies 


m Issued in parallel 


eee lCfPenaing on 


resources 
Instruction IE CES REESE 


Bundles 








3 instructions + 
template 

3 Xx 41 bits + 5 bits = 
128 bits 
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2x32-bit SIMD Integer 





4x16-bit SIMD Integer 


8x8-bit SIMD Integer 





64-bit DP F.P. 
2x32-bit SIMD SP-F.P. — 
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64-bit Memory Access 


= 18 BILLION Giga Bytes accessible 
e 2° == 18,446,744,073,709,551,616 
= Byte addressable access with 64-bit pointers 
e 64-bit virtual address space 
e HW support for 32-bit pointers 
m Access granularity and alignment 
° 1,2,4,8,10,16 bytes 
°. Alignment on naturally aligned boundaries is recommended 
e Instructions are always 16-byte aligned 
=m Support for both Big and Little endian byte order 


m Memory hierarchy control 


r 
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e 2.1 GB/s front-side bus 


ig, Byte Addressable 64-bit 
Virtual Address-Space 





intel . 
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Memory Hierarchy Control 


m Software can explicitly control memory accesses 
e Specify levels of the memory hierarchy affected by the access 
° Allocation and Flush resolution is at least 32-bytes 


E Allocation (Prefetch) 
° Allocation implies bringing the data close to the CPU 
° Allocation hints indicate at which level allocation takes place 
e Used in load, store, and explicit pre-fetch instructions 


m De-allocation and Flush 
° Invalidates the addressed line in all levels of cache hierarchy 
e Write data back to memory if necessary 


p 


[| Three levels of cache (full speed L2 cache, 2/4MB L3-cache) 


"ww. & Atomic operation support 





IA-64 Control over Cache (De)Allocation 
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Memory Access Ordering 


E Explicit control 


e Memory Fence mf - ensures all prior memory operations are 
seen prior to all future memory operations 


° Acquire Load Id.acq - ensure | am seen prior to all future 
memory operations 


e Release store st.rel - ensure that all prior memory operations 
are seen prior to me 


e Synchronize instruction caches sync.i - Ensure all instruction 
caches have seen all prior flush cache instructions 


m Implicit - applicable to semaphore instructions 


° xchg Exchange mem and General Register (GR) 
e cmpxchg Conditional exchange of mem and GR 
° fetchadd Add immediate to memory 


E Strong ordering model is compatible with IA-32 Ordering 


= | 
i DA w Copyright © 2000, Intel Corporation. All rights reserved. 


*Other brands and names are the property of their respective owners 


| on 
ED 




















Integer Registers FP Registers Branch Registers ie 
egisters 
A 0 É E bit 0 
FRO BRO me 
FRI ae 
BR7 wee PR1 E 
uu ——— 
FR32 m PR15 E 
PR16 œ 
FR127 PR63 $ 
— 32 Static | 32 Static 16 Static @ 


7 96 Framed, Rotating z 96 Rotating 48 Rotating L 
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Context Switch a 


= “Normal” 
— full context switch saves all registers 
— GR and FR 


E “Lazy” 
— saves only GR registers or specific range 
(GRO-31, GR32-GR127) 
= “Fast” 


— doesn’t save any registers and uses 16 
separate banked “shadow” registers ‘(OS 
only, GR16’-GR371’) 


— e.g. interrupt and exception handling 
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Register Stack 


= GRs 0-31 are global to all procedures 


= Stacked registers begin at GR32 and 
are local to each procedure 


= Each procedure’s register stack frame 
varies from 0 to 96 registers 


=u Only GRs implement a register stack 
e The FRs, PRs, and BRs are global to all 
procedures 
= Register Stack Engine (RSE) 


° Upon stack overflow/underflow, registers 
are saved/restored to/from a backing store 
transparently 





Optimized CALL/RETURN 
and Parameter Passing 
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IA-64 
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Register Stack In Work 


m Call changes frame to contain only the caller's output 


= Alloc instr. sets the frame region to the desired size 
e Three architecture parameters: local, output, and rotating 


= Return restores the stack frame of the caller 


Virtual e o 
(01 ers | 





Eo oa O Á~ a D 
Inputs) 

Call Alloc Ret 32 
—J> PROCB ===>>  PROCA 





PROCA === PROCB 
A-6A Avoids Register Spill/Fill 
among Procedure Calls 
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Register Stack Engine 


allocate 








release 
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IA-64 GR (Integer) Registers only 
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Predication: 
Control Flow to Data Filo 


Traditional Arch. 1A-64 
[cmp a, b|) [cmp a, b > p1, p2| 
p2|| x=1 || x=2 | Rea 








else 
m Conditional execution based on 
qualifying predicate 
then = 64 predicate registers 
m Can be combined with logical 
operations 


Removes/Reduces Branches and 


|A-64 
Enables Parallel Execution 
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Predication ... 


= Unpredictable branches removed 
— Misprediction penalties eliminated 


= Basic block size increases 
— Compiler has a larger scope to find ILP 


= ILP within the basic block increases 
— Both “then” and “else” executed in parallel 


E Wider machines are better utilized 


Predication Enables and 
Enhances ILP 


IA-64 
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Parallel Compares 


= Three new types of compares: 
e AND: both target predicates set FALSE if compare is false 
e OR: both target predicates set TRUE if compare is true 
e ANDOR: if true, stets one TRUE, set other FALSE 


aa ee 
am 





IA-64 Reduces Critical Path 
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Software Pipelining = 


Sequential Loo Software-Pipelined Woop = ee 


a compute 
a store 





-—— Time —— 
H Time —— 


m [raditional architectures use loop unrolling 
— Results in code expansion and increased cache misses 
= |A-64 Software Pipelining uses rotating registers 
— Allows overlapping execution of multiple loop instances 
m Predication controls the pipeline stages 


intel . 


a-ṣ4 Provides Direct Support for 
Software Pipelining 


Copyright © 2000, Intel Corporation. All rights reserved. 
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Software Pipelining 


= IA-64 features that make this possible 
— Full predication to define pipeline stages 
— Special branch handling features 
— Loop branches 
— Special loop registers (LC, EC) 
— Register rotation: removes loop copy overhead 
— Predicate rotation: removes prologue & epilog 


m [raditional architectures use loop unrolling 


— High overhead: extra code for loop body, prologue and 
epilog 


— Consumes a large number of registers 
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Software Pipelining Fi 
p17 p18 p19 
0 of o ls 
O ; 0 m o E Prologue 
T 1 i= o= 
2 1 i= hl; 
2. 4 1 1 = Kernel 
s 1 E ll 
0 E 1 lo Epilogue 
0 ob 1 lo 
compute store 
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Software Pipelining 


Prologue 


IA-64 Features 
e Rotating Registers 


e Loop branches 
n) - Full predication 
e Rotating Predicates 


Kernel-only code 
using stage predicates 








Kernel 









Epilogue 
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Register Rotation 


GR32-127 and FR32-127 can rotate (specified range) 
Separate rotating register base for each set (GR, FR) 
Loop branches decrement all register rotating bases (RRB) 


Instructions contain a “virtual” register number 
e physical register # = RRB + virtual register # 


36 
35 


34 


Phys. Register 


33 





32 


Predicate register range also rotates. 
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Banden Barrial 
/ j 
/ / 
7 7 
Control Speculation Data Speculation 
moves loads above moves loads above 
branches / calls possibly conflicting 
stores 
a- Speculation reduces the impact 
of memory latency 
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Control Speculation 







Traditional Arch. 
= instr.i | 


Detect exception 





INS 2 O 
Barrier 


AY 








H Deliver exception 





= Control Speculation moves loads above branches 
e Detected exception indicated using NaT bit / Naf Val 


m Check raises detected exceptions 
m Branch barrier broken to minimize memory latency 
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Hoisting Uses 
IA-64 


“ūse=r1i Re VILE 
pranch Barrier prancn 






Traditional Arch. 















v 


| D : | : Recovery code 
___idri= Se chksrt P idrt= 
v A 





pranch 


mE All computation instructions propagate Nafs to 
reduce number of checks to allow single check on 
results 


= Compares also propagates when writing predicates 
intel (fy Copyright © 2000, Intel Corporation. All rights reserved. a 
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Data Speculation 





Traditional Arch. 





= Data Speculation moves loads above possibly 
conflicting stores 
e Keeps track of load addresses used in advance (ALAT) 


m Advanced-loaded data can be used speculatively 
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Hoisting Uses 


Traditional Arch. 





< Speculative 
use 








st? Barrier 


| DA : Recovery code 





pranch 


Data and Control Speculation 
Can be combined 


IA-64 
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Advanced Load Address llc ol 
-ALAT 


Bom: inserts entries 


= Conflicting stores remove entries 
e also id.c.clr, chk.a.clir 


= Presence of entry indicates success 
° chk.a branches when no entry is found 


Id.a reg# = 
= reg = adr 
> —_ sifaddr) 
= reg E addr 


chk.a reg# ? 








___reg# E add _——__ 
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Branch Architecture 


= Branch types 
° IP-offset branches (21-bit disp.) 
e Indirect branches via 8 branch registers 
e HW-supported counted loop control instr. 


E Branch Predict hints 


° Advance information on downstream 
branches and branch conditions 


e Branch hints can be static or dynamic 


= Multi-way branches 
e Bundle 1-3 branches in a bundle 
°- Allow multiple bundles to participate 


Aggressive branch prediction 
a-s, Decoupled front end with code prefetch, 
Branch hints reduce misprediction 


and overhead 


intel. 
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Traditional Architecture: 


compute 
compute1 
computez 


compare (4==9) 
B: branch_if_eq > Target 





Target: 


IA-64 Architecture: 
nint B, larget (early Nint) 
computeU 
compute1 
computez 
compare(a==0) 

5: brancn_if_ eg > Target 


Targat: 


32 





Integer Architecture Fi 


= 128 general registers (64 bit; 18+631) 
= Full 64-bit support (as well as 8-16-32-bit) 
= XMA: Integer Multiply-Add instruction (I = 1 * j + k) 
m Integer multiply is executed in the floating-point unit 
= Data transfer 
— load, store, GR €> FR conversion 
= SIMD Integer operations 
= Divide / remainder deferred to software 


— Based on floating-point operations 
— High throughput achieved via pipelining 
| 2 Up to 4 Integer/ALU operations per clock 





a- EXCellent Server & Security 
Application Performance 
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|A-64 SIMD - Integer ae 


= Exploits data parallelism with SIMD 
(Single Instruction Multiple Data) 


= Performance boost for audio, 64 bits 
video, imaging, streaming etc. 8x8, 4x16, or 2x32 
functions ee 
= GRs treated as 8x8, 4x16, or 2x32 MEAN 





bit elements 


= Several instruction types 
e Addition and subtraction, multiply 





. 
bs | b2 | bi | bo- 
- Pack/Unpack 


e Left shift, signed/unsigned right shift 


= Compatible with Intel® MMX™ 
Technology 








IA-64 Performance Boost for all Data Parallel Apps 
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= Fused Multiply Add Operation 





— An efficient core computation unit 
— Greater precision, faster than independent multiply and add 


Abundant Register resources 


— 128 registers (32 static, 96 rotating) 


High Precision Data computations 


— 82-bit unified internal format for all data types 
— Full IEEE.754 support 


Software divide/square-root 


— High throughput achieved via pipelining 


Wide (Speculative) Memory Access 


|A- 


— Dual Load-Pair support 
— Address memory latency 


Í Fa | 2 independent FP Units 
— Pre-fetch support 5 


I| Up to 4 DP FP operations per 
[N Clock 
, Excellent Workstation & HPC Up to 4 DP FP operands loaded 


'A pplication Performance per clock (from L2 cache) 
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A-64 SIMD - F.P. = 


= Exploits data parallelism with SIMD 
(Single Instruction Multiple Data) 
m Up to 2x performance boost 
= F.P. Registers treated as two 32 bit 64 bits 
single precision elements 
° Full IEEE.752 compliance 


- Availability of fast divide (non IEEE) oa | ao 


= Compatible with Intel® Streaming + 


SIMD Extensions (SSE) owa | o 
E Up to 8 SP FP operations per clock 


A-64 Enables World Class 3D 


Graphics Performance 


2x32 bit SP FP elements 
SSS Se 
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Floating-Point Status Register 
vs MBEE 


13 13 13 6 





6 13 
pen Four Sets for Parallelism & Speculation „AA 


= Contains dynamic control/status for FP operations 


= Trap/Fault disable bits 
e trap disables for IEEE exception events 
° trap disable “D” for denormal operand exception 


= 4 separate status fields > 4 computational env. 


° Each field specifies precision/rounding mode, Trap disables, 
flush to zero, widest range exponent 


°. Each field reports sticky exception flags 
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Intel® Itanium Processor 


= 1A-64 starts with Itanium processor 
= Platform with Intel® 460GX chipset 


= Solid progress following first silicon 
e More than 4 OS running today 


e Demonstrated real IA-64 Windows 2000 
and Linux applications on real hardware 


e Engineering samples shipping to OEMs, 
IHVs and ISVs i Pai 


= Comprehensive validation underway 


'@ | Leading-Edge Implementation of IA-64 
‘mun, FOr World-Class Performance 
320M transistors: 25M in CPU, 295M in L3 cache 


More and better Capacity & Capability 
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Extending Intel® € 
Architecture 


IA-64 


: Madison* rod 
Extends IA Headroom, eo 


Scalability and Availability 
for the Most 
Demanding Environments 


` 


ITANIUM. 


Deerfield* 
VANS cy price/perf 











McKinley” 








System Performance 






| Outstanding 
_ Performance for 
32 Bit Volume Apps 


‘99 ‘00 ‘01 ‘02 
.18u -13u 


All dates specified are target dates provided for planning purposes only and are subject to change. 
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Itanium Processor Block k 
Dieleleelan 


L1 Instruction Cache and 
Fetch/Pre-fetch Engine 
Instruction 


A-32 


- meza 
Queue =~ 8 bundles | Decode 








9 Issue Ports —— t | LenS 
vv. yw [ 
Register Stack Engi ae 


a 
_— T 


L2 Cache 


L3 Cache 





cC Controller 
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Itanium” Processor Features 


= Up to 6 instructions issued per clock 

= 9 instruction issue ports 

PALEE ole)ialmelal ic 

= 4 integer units 

= 3 branch units 

= 3 levels of cache at full speed 

= L1 and L2 on-chip, L3 (2/4 MB) on cartridge 
m 10-stage in-order pipeline 


= | 
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Itanium Processor Memony 


Hierarchy 


= L1 Caches (on-chip) 
e Data Cache 
— 4-way, 32 byte cache lines 
— FP loads bypassed to L2 
e Instruction Cache 
— 4-way, 32 byte cache lines 


L2 Cache (on-chip) 
e Unified instr. & data cache 
— 6-way, 64 byte cache lines 


L3 Cache (on cartridge) 
e Full speed unified 
— 2/4 MBytes 
— 4-way, 64 byte cache lines 
Memory 


e Frontside Bus 
— 2.1 GBytes/sec 


Intel. 





VOTUA 


Spm eee a | ee EP (geass Fii TA mm i 
i 7 Mt a oe 
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[i 
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16 bytes/clk 


Itanium™ Processor Cartridge 
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Itanium Processor Pipeline 


= 6-wide EPIC hardware under compiler control 
— Parallel hardware and control for predication & speculation 
— Efficient mechanism for enabling register stacking & rotation 
— Software-enhanced branch prediction 


= 10-stage in-order pipeline designed for: 
— Single cycle ALU (4 ALUs globally bypassed) 
— Low latency from data cache 


m Dynamic support for run-time optimization 
— Decoupled front end with prefetch to hide fetch latency 


— Non-blocking caches, register scoreboard to hide load 
latency 


— Aggressive branch prediction to reduce branch penalty 
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Execution 
> 4 single cycle ALUs, 2 
id/str 


> PRredicatedelveny and 
branch 





- NAT / Exception / 
Retirement 










Word-line Exception 
D1010 (2: Execute Detect 


KIEJ | 


Register Write-back 
Read 


ROCIE Expand Rename 
Inst. Pointer Fetch 
Generation 















Operand Delivery 
> Register read + bypasses 
e Register scoreboard 

> Predicated dependencies 
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Selected Instruction 
Latencies 


Instruction Description 
Class 

FMAC Floating arithmetic 

FMISC Floating min, max, ... 

IALU Integer ALU 


FLD,FLDP FP load (L2 hit) 
FP load (L3 hit) 

LD Integer load except Milo mem (L1 hit) 
Integer load except Mle Mem (L2 hit) 
Integer load except Milo Mem (L3 hit) 
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Latency 
(Cycles) 


= 
3 
1 


Itanium Processor Based 
Platform Features 


= 1-4 Itanium processor SMP system 

= High clock speed target 800 MHz 

= 6-way instruction issue and execution 

= Up to 64 GB SDRAM memory (460GX) 

a 4.2 GB/s memory bandwidth (peak) 

a 2.1 GB/s system bus 

= 2.1 GB/s I/O bandwidth (peak) 

= 1.0 GB/s AGP Pro graphics bus 

=a 3.2 GFLOPS DP-F.P. peak perf. (6.4 in SP) 





l>] SHV Workstation platform: 2-way 
_ SHV Server platform: 4-way 


alts) 
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528MB/s Address Bus 
4x PCI 64/66 


> WB 
ET a l oyy 









528MB/s SAC {CE T 
Addr/Ctrl Data 
E FWH 64 SDRAM DIMMs 
u u E (64GB max.) 
528MB/s AUDIO 











































































































































































































PCI 64/66 
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264MB/S mam 
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Workstation Platform 


ITANIUM. 















= 
( telge) 


ne 
ITANIUM. 


T 
ee 
Address Bus 





err 
ITANIUM. 








Data Bus 





Addr/Ctrl Data 


o FWH 16 SDRAM DIMMs 
(16GB max.) 
Audio 


mA 


PCI 64/66 zij m 
528MB/s 


PCI 164/56 


=e 
a MN 



















































































SUR EEEE DMI = 
264MB/s 
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| 
fel y £ 


HPC with Intel® Architectūuref =. 


From lop to Bottom 





Servers 







Clusters ff 3555 
WS — TT E 
SMP 
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Itanium Processor Based 
System Designs 






32 way SMP 





16-way cc:NUMA 
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HPC Market Segment Is 
Changing 


Open Industry Standards 
using Building Blocks 


— WindowsNT *, Linux® 
— OpenMP” 
— MPI*, PVM* 










Proprietary Solutions 
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|A-64 Operating Systems 


a 


„D a= = 
lie# 


1] [a HEWLETT 


Lg les aiara MEE PS HP-UX* 


monterey UNIX’ 








Novell. 


Modesto” 


Trillian [PORES 





OSVs on track for 


mum, Itanium" processor 


intel ; 
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[A-64 Linux* 
(Trillian* Project) 





= Team includes VA Linux, IBM", Intel, HP*, SGI’, 
Cygnus”, CERN’, Red Hat*, SuSE’, TurboLinux’, and 
Caldera* 
m Running applications 
e Demonstrated on Itanium™ processor system at IDF (8/99) 
e Major applications ported to date include Apache* and Sendmail 
e Development version release available 
e Full development OS releases from distributors available 


= Open source OS and compilers available 
E http:/www.linuxia64.org 


= | 
i DA im Copyright © 2000, Intel Corporation. All rights reserved. 
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C/C+4Data Models 


OS Implements the Data Models 
ILP32 
— int, long and ptr are 32 bits 
— Used by 32-bit OSs 


LP64 
—int is 32 bits i 


— long and pointer are 64 bits 
— Used by 64-bit UNIX OSs 
P64 (or LLP64) 


— int and long are 32 bits; pointer is 64 bits 
— Used by Win64* and Modesto* 


* Third party names and brands are the property of their respective owners 
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| 
mtel, ide 


Intel® Compiler for e 
C++ FORTRAN 90 
Front End Front End 
Profile Guidance Loop © Loop Unrolling 
(PGOPTI) Register Variable 
Machine instruction lowering Mage 
Software Pipelining Ont; P (IPO) Constant Prop. 
(rotating registers, loop branches) i Si 
Predication (parallel compares) High-Level Strength Reduction 
Global scheduling Optimizer (HLO) Copy Propagation 


JEG Redundancy Elim. 


Block © Block ordering (branch hints) (branch hints) Independent i 
=. Dead Store Elim. 
Global register allocation Optimizer (ILO) 
(register stack, ALAT, UNAT) Dead Code Elim. 
Function splitting Code Generator 
(I cache and TLB locality) (ECG) 


i DA im Copyright © 2000, Intel Corporation. All rights reserved. 
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(control and data speculation, 
multi- ia A branches) 














SS 
ITAN 







IA-64 HPC Compilers & Tools 
=u C/C++, FXX, Java, ... 

= OpenMP 

= MPI, PVM 

= Performance Libraries 
= Vtune, ... 


i l PRO 
TOLSE T 
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|A-64 Application Benefits 


Outstanding Performance 


e Removes performance bottlenecks 
o Large register files 

High Parallelism 

Predication 

SW pipelining support 

Memory latency hiding 


e 64-bits allows bigger address space 
e |EEE-accurate floating point 


A OO eel 


è Ability to run IA-32 applications 
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IA-64 User Benefits 


e Big in-memory data structures and DB 
e Large file system and data files 

e Efficient large integer calculations 

e Fast 64-bit F.P. calculations 

e Fast Security processing 

e More and faster transactions 

e More services 

o Higher throughput 


e Improved availability and manageability 
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Intel: More Than J ust 
MICFOPFOCESSOFS ProwesOutnitter 











ISP Program 


Performparite Tuning 


4 i 





Intel’ Create & Share™ 
Camera Pack 
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IA-64 : Tne Future for HPE 


aa 


} a 


ITANIUM. 





http://developer.intel.com/ 
http://developer.intel.com/design/ia64/ 


http://developer.intel.com/technology/itj/qg41999.htm 





Glossary 








Glossary ae 


= ALAT (Advanced Load Address Table) - cache used for data 
speculation which stores the most recent advanced load 
addresses 


m ALoad/Acheck - advanced load/check (Data Speculation) 


m Basic Block - code which is between two branches; if one 
instruction in the block of code executes, then all 
instructions in that block will also execute 


=m Control Speculation - the execution of an operation before the 
branch which guards it; used to hide memory latency 


= Data Speculation - the execution of a memory load prior to a 
store that precedes it, and that may potentially alias it; used 
to hide memory latency 


=. | 
i DA w Copyright © 2000, Intel Corporation. All rights reserved. 


*Other brands and names are the property of their respective owners 





e 
ITANIUM. 


Glossary 


= |A-32 - the name for Intel’s current ISA (32-bit and 16-bit) 


=m IA-32 System Environment - the system environment of an IA- 
64 processor as defined by the Pentium® processor and 
Pentium® Pro processor 


m |A-64 — Intel® 64-bit Architecture is composed of the 64-bit 
ISA and IA-32; IA-64 integrates the two into a single 
architectural definition 

= |A-64 Firmware - the Processor Abstraction Layer and the 
System Abstraction Layer 

mE |A-64 System Environment - IA-64 operating system with 
privileged resources along with capability to support the 
execution of existing IA-32 applications 

= Instruction Set Architecture (ISA) - defines application level 
resources which include: user-level instructions, addressing 
modes, segmentation, and user visible register files 


=. | 
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Glossary 





e 
ITANIUM. 


Nal bit/NaT Value (Not a Thing) - used with control speculation to 
indicate that a number stored in a general or floating-point 
register is not valid 


Predication - the conditional execution of an instruction; used to 
remove branches from code 


Processor Abstraction Layer (PAL) - the IA-64 firmware layer which 
abstracts IA-64 processor features that are implementation 
dependent 


m Sload/SCheck - speculative load/check (control speculation) 
=m System Abstraction Layer (SAL) - the IA-64 firmware layer which 


abstracts IA-64 system features that are implementation 
dependent 


System Environment - defines processor specific operating 
system resources which include: exception and interruption 
handling, virtual and physical memory management, system 
register state, and privileged instructions 
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Backup 





iel ge 


SW Pipelined Loop Example 


= DAXPY inner loop : dy[i] = dy[i] + (da = axil) 
— 2 loads, 1 fma, 1 store / iteration 
= Machine assumptions 
— can do 2 loads, 1 store, 1 fma, 1 br / cycle 
— load latency of 2 clocks 
— fma latency of 1 clock (not realistic, but good for 
example) 
= Special Registers 
— LC: Loop Counter 


= | 
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Example: Pipeline ry 
= Each column represents 1 source Iteration 
load dx,dy 
dy + da * dx 
store dy 
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Example Code 


.rotf dx[3], dy[3], tmp[2] 


3 // #iterations-1 
4 // #stages 
mov pr.rot = 0x10000 


mov -i a a 


mov ar .ec 


looptop: 
(p16) ldfd dx[0] = [dxsp],8 
(p16) ldfd dy[0] = [dysp],8 
(p18) fma.d tmp[0] = da, dx[2], dy[2] 
(p19) stfd [dydp] = tmp[1],8 
br.ctop looptop 


= | 
| 
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Loop Execution 














Execution Sequence 


LC=? EC=? 
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Loop Execution 


Execution Sequence 
(p16) Id, (p16)laye(ptsyima (p19) st 











LC=3 EC=4 
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Loop Execution 


Execution Sequence 
MMM) (p16)id, (p16) T ooo 











LC=3 EC=4 
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Loop Execution 


Executom Sequence 
CHOSEO EE LN 4018) fma 


= (p16) Id, (pipe (p19) si 





LC=2 EC=4 
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Loop Execution 


Execution Sequence 
(p16) Id, L n 018) fma 
(p16) ld, (PA 0 
E (p16) Id, (p16)Id, (p18)fma 


LC=1 EC=4 








Copyriç 
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Loop Execution 


Execution Sequence 
(p16) Id, L n 018) fma 
(p16) Id, (PAO E (p19) st 
(p16) Id, (p16)ld, (p18)fma (© 5 
EEE) (p16) Id, (p16) Id, (p18)fma (p19) st 
(p19) 


(p18) 





LC=0 EC=4 





Copyric 





*Other brands and names are the property of their respective owners 


Loop Execution 


Execution Sequence 


TOL ETTE 
(p16) Id, (p16) Id 
(p16) Id, (p16) Id 
(p16) Id, (p16) Id 
EE (p16) ld, (piod 


y 
y 
y 
y 
y 
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18) fma 
SARITA 
(p18) fma 
(p18) fma 
(p18) fma 





(pt9) SI 
(p19) st 
(p19) st 
(p19) st 
(p19) st 


LC=0 EC=3 


Loop Execution 


(p16) Id, 
(p16) Id, 
(p16) Id, 
(p16) Id, 
(p1 9) (p16) Id, 


(P18) — (516) 1c, 







Copyrig 


Execution Sequence 


(p16) Id 
(p16) Id 
(p16) Id 
(p16) Id 
(p16) Id 
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$018) fmd 
SARITA 
(p18) fma 
(p18) fma 
(p18) fma 
(p18) fma 


(pt9) SI 
(p19) st 
(p19) st 
(p19) st 
(p19) st 
(p19) st 


LC=0 EC=2 





Loop Execution 


Execution Sequence 
(outs RO MER o ':: #018) fma 
(p16) Id, S Fa ie = (p19) si 
(p16) Id, (p16)la, (p18)fma (© © si 
m i (p16) Id, (p16)ld, (p18)fma (p19) st 
JELE (p19) (p16)ld, (p16)id, (p18)fma (p19) st 
60: 0 _|(p18) (p16) Id, (p16)id, (p18)fma (p19) st 
59: 0 E (p16)id, (pP16)id, (p18)fma (p19) st 


y 
y 
y 
y 
y 
y 
y 


he as 


58: 0 | (p16) 
0—57: 0 | (p63) 





LC=0 EC=1 
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Loop Execution 


Execution Sequence 


(p16) Id, (p16)id 
(p16) Id, (p16)Id 
(p16) Id, (p16) Id 
(p16) Id, (p16) Id 
(p16) Id, (p16)1d 
(p16) Id, (p16) Id 
(p16) Id, (p16) Id 
fall through 


y 
y 
y 
y 
y 
y 
y 


= 


oy 








Copyrig | < 
*Other brands and names are the property of their respective owners 


18) fma 
SARITA 
(p18) fma 
(p18) fma 
(p18) fma 
(p18) fma 
(p18) fma 








(pt9) SI 
(p19) st 
(p19) st 
(p19) st 
(p19) st 
(p19) st 
(p19) st 


LC=0 EC=0 


